Data wrangling exercise

Data wrangling is the process of importing, cleaning and transforming raw data into actionable information for analysis.

read.csv
read.table,
read_excel
fromJSON
as.data.frame
xmlParse
xmlToList
write.csv
write.table
write.xlsx
toJSON
write
write.xml

libraries:
readxl
xlsx
rjson
plyr
XML
kulife

Exercise 1
Import the data.csv file to the csv_file object.
csv_file <- read.csv("data.csv", header = TRUE)

Exercise 2
Import the data.txt file to the txt_file object.
txt_file <- read.table("data.txt")

Exercise 3
Import the data.xls file to the xls_file object.
xls_file <- read_excel("data.xls", sheet = 1)
#OR
xls_file <- read.xlsx("data.xls",1)

Exercise 4
Import the data.xlsx file to the xlsx_file object.
xlsx_file <- read_excel("data.xlsx", sheet = 1)
#OR
xlsx_file <- read.xlsx("data.xlsx",1)

Exercise 5
Import the data.json file to the json_file object.
json_file <- fromJSON(file="data.json")
# It returns a list, you will probably require a data frame, therefore use
# the command below in order to transform it.
json_file <- as.data.frame(json_file)

Exercise 6
Import the data.xml file to the xml_file object.
xml_file <- ldply(xmlToList("data.xml"), data.frame)

Exercise 7
Export the csv_file as “data.csv” and txt_file as “data.txt”.
write.csv(csv_file, file = "data.csv")
#OR
write.table(csv_file, file = "data.csv", sep = ",")
write.table(txt_file, file = "data.txt", sep="\t")



Exercise 8
Export the xls_file as “data.xls” and xlsx_file as “data.xlsx”.
write.xlsx(xls_file, "data.xls")
write.xlsx(xlsx_file, "data.xlsx")


Exercise 9
Export the json_file as “data.json”.
data_json <- toJSON(json_file)
write(data_json, "data.json")


Exercise 10
Export the xml_file as “data.xml”.
write.xml(xml_file, file="data.xml")


Before proceeding, it might be helpful to look over the help pages for the getURL, fromJSON, ldply, xmlToList, read_html, html_nodes, html_table, readHTMLTable, htmltab.

Moreover please load the following libraries.
install.packages("RCurl")
library(RCurl)
install.packages("rjson")
library(rjson)
install.packages("XML")
library(XML)
install.packages("plyr")
library(plyr)
install.packages("rvest")
library(rvest)
install.packages("htmltab")
library(htmltab)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Retrieve the source of the web page “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv” and assign it to the object “url”

####################
#                  #
#    Exercise 1    #
#                  #
####################

url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv")

Exercise 2

Read the csv file and assign it to the “csv_file” object.

####################
#                  #
#    Exercise 2    #
#                  #
####################

csv_file <- read.csv(text = url)

Exercise 3

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt” and then assign it to the “txt_file” object.
Note: it is a txt file, so you should use the adequate function in order to import it.

####################
#                  #
#    Exercise 3    #
#                  #
####################

url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt")

txt_file <- read.table(text = url)

Exercise 4

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json” and then assign it to the “json_file” object.
Note: it is a json file, so you should use the adequate function in order to import it.

####################
#                  #
#    Exercise 4    #
#                  #
####################

url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json")

json_file <- fromJSON(url)

Exercise 5

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml” and then assign it to the “xml_file” object.
Note: it is a xml file, so you should use the adequate function in order to import it.

####################
#                  #
#    Exercise 5    #
#                  #
####################

url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml")
xml_file <- ldply(xmlToList(url), data.frame)

Exercise 6

We will go through web scraping now. Read the html file “http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html” and assign it to the object “url”.
hint: consider using read_html

####################
#                  #
#    Exercise 6    #
#                  #
####################

url <- read_html("http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html")

Exercise 7

Select the “table” nodes from the html document you retrieved before.
hint: consider using html_nodes

####################
#                  #
#    Exercise 7    #
#                  #
####################

tbls <- html_nodes(url, "table")

Exercise 8

Convert the node you retrieved at exercise 7, to an actionable list for processing.
hint: consider using html_table

####################
#                  #
#    Exercise 8    #
#                  #
####################

tbls_read <- url %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)

Exercise 9

Let’s go to a faster and more straight forward function, retrieve the html document like you did at exercise 6 and make it an actionable list using the function readHTMLTable.

####################
#                  #
#    Exercise 9    #
#                  #
####################

url <- "http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html"
tbls_xml <- readHTMLTable(url)

Exercise 10

This may be a bit tricky, but give it a try. Retrieve the html document like you did at exercise 6 and make it an actionable data frame using the function htmltab.

####################
#                  #
#    Exercise 10   #
#                  #
####################

df_pop <- htmltab(doc = url, which = "//th[text() = 'Rank']/ancestor::table")